Using Association Rules of the Online Retail Dataset
1 Load Data
We first want to load our datasets and prepare them for some simple association rules mining.
tnx_data_tbl <- read_rds("data/retail_data_cleaned_tbl.rds")
tnx_data_tbl %>% glimpse()## Rows: 1,021,424
## Columns: 23
## $ row_id <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month <chr> "December", "December", "December", "December", "Dec…
## $ invoice_dow <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
To use our rules mining we just need the invoice data and the stock code, so we can ignore the rest. Also, we ignore the issue of returns and just look at purchases.
tnx_purchase_tbl <- tnx_data_tbl %>%
filter(
quantity > 0,
exclude == FALSE
) %>%
select(invoice_id, stock_code, customer_id, quantity, price, stock_value, description)
tnx_purchase_tbl %>% glimpse()## Rows: 994,628
## Columns: 7
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "489434"…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "2…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "130…
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3,…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.75…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, 3…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHTS…
We now write this data out as a CSV so arules can read it in and process it.
tnx_purchase_tbl %>% write_csv("data/tnx_purchase_tbl.csv")We also want to load the free-text description of the various stock items as this will help will interpretation of the various rules we have found.
stock_code_lookups_tbl <- read_rds("data/stock_code_lookup_tbl.rds")
stock_code_lookups_tbl %>% glimpse()## Rows: 4,733
## Columns: 2
## $ stock_code_upr <chr> "10002", "10002R", "10080", "10109", "10120", "10123C",…
## $ desc <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", …
2 Basket Analysis with Association Rules
We now want to do some basic basket analysis using association rules, which tries to determine which items get bought together, similar to taking a graph approachin many ways.
basket_arules <- read.transactions(
file = "data/tnx_purchase_tbl.csv",
format = "single",
sep = ",",
header = TRUE,
cols = c("invoice_id", "stock_code")
)
basket_arules %>% glimpse()## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## ..@ itemInfo :'data.frame': 4965 obs. of 1 variable:
## .. ..$ labels: chr [1:4965] "10002" "10002R" "10080" "10109" ...
## ..@ itemsetInfo:'data.frame': 41410 obs. of 1 variable:
## .. ..$ transactionID: chr [1:41410] "489434" "489435" "489436" "489437" ...
Now that we have this data we can look at some basic plots much like we produced before. For example, we can look at the relative frequency of the different items.
itemFrequencyPlot(basket_arules, topN = 20)itemFrequencyPlot(basket_arules, topN = 20, type = "absolute")The stock codes do not mean a huge amount to us, so we also want to look at the description field for these items.
freq_codes <- itemFrequency(basket_arules) %>%
sort(decreasing = TRUE) %>%
head(20) %>%
names()
tnx_purchase_tbl %>%
select(stock_code, description) %>%
filter(stock_code %in% freq_codes) %>%
distinct() %>%
drop_na(description) %>%
group_by(stock_code) %>%
summarise(
.groups = "drop",
desc = str_c(description, collapse = " : ")
) %>%
arrange(stock_code) %>%
datatable()2.1 Basic Concepts
The basic ideas of association rule mining and basket analysis draws on basic ideas from probability theory.
We speak in terms of the itemset: that is, a collection of one or more items that co-occur in a transaction.
For example, suppose we have a list of transactions as follows:
| ID | Items |
|---|---|
| 1 | milk, bread |
| 2 | bread, butter |
| 3 | beer |
| 4 | milk, bread, butter |
| 5 | bread, butter |
Using the above set of transactions, and itemset may be “milk” or “bread, butter”.
The support of an itemset \(X\), \(\text{Supp}(X)\), is defined as the proportion of transactions in the dataset which contain the itemset.
In the above example:
\[ \text{Supp}(\text{\{milk, bread\}}) = \frac{2}{5} = 0.40. \]
A rule, \(X \Rightarrow Y\), between two itemsets \(X\) and \(Y\) is a directed relationship of the itemset \(X\) showing the presence of \(Y\). The rule is not symmetric: \(X \Rightarrow Y\) and \(Y \Rightarrow X\) are not the same.
The confidence for the rule \(X \implies Y\), \(\text{Conf}(X \Rightarrow Y)\) is defined by
\[ \text{Conf}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X)}. \]
So, to calculate the confidence for a rule:
\[ \text{Conf}(\text{\{milk, bread\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{0.4} = 0.5. \]
To illustrate how rules are not symmetric:
\[ \text{Conf}(\text{\{butter\}} \Rightarrow \text{\{milk, bread\}}) = \frac{0.2}{0.6} = 0.33. \]
Finally, we want a measure of the strength of the relationship between the itemsets \(X\) and \(Y\). That is, measuring the effect of the presence of \(X\) on the presence of \(Y\). We measure this by defining the lift of a rule as
\[ \text{Lift}(X \Rightarrow Y) = \frac{\text{Supp}(X \cup Y)}{\text{Supp}(X) \text{Supp}(Y)}. \]
Again, we repeat our calculations for our rule.
\[ \text{Lift}(\text{\{bread, milk\}} \Rightarrow \text{\{butter\}}) = \frac{0.2}{(0.4)(0.6)} = \frac{0.2}{0.24} = 0.8333 \]
Lift values greater than 1 implies the presence of \(X\) increases the probability of \(Y\) being present when compared to the unconditional probability.
Now that we have these metrics and concepts, we can turn our attention to trying to find rules in a given dataset, using these metrics to rank them.
Rather than using brute-force approaches to discovering these rules, we use a number of different algorithms to find associations within the dataset.
The two main algorithms for discovering some rules are the apriori and the
eclat algorithms.
2.2 Construct apriori Rules
We now want to construct the association rules using the apriori algorithm.
To do this, we need to set parameters such as the minimum support and the
minimum confidence level.
This gives us a set of association rules, along with the support and lift.
basket_apriori <- apriori(
basket_arules,
parameter = list(supp = 0.005, conf = 0.8)
)## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 207
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4965 item(s), 41410 transaction(s)] done [0.28s].
## sorting and recoding items ... [1398 item(s)] done [0.02s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.19s].
## writing ... [423 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
basket_apriori_tbl <- basket_apriori %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))
basket_apriori_tbl %>% glimpse()## Rows: 423
## Columns: 6
## $ rules <chr> "{22917,22918} => {22916}", "{22916,22918} => {22917}", "{2…
## $ support <dbl> 0.005143685, 0.005143685, 0.005047090, 0.005047090, 0.00560…
## $ confidence <dbl> 0.9466667, 0.9508929, 0.9500000, 0.9372197, 0.9317269, 0.92…
## $ coverage <dbl> 0.005433470, 0.005409321, 0.005312726, 0.005385173, 0.00601…
## $ lift <dbl> 157.4356, 156.2558, 156.1091, 155.8645, 153.1064, 153.1064,…
## $ count <int> 213, 213, 209, 209, 232, 232, 213, 224, 224, 225, 225, 209,…
We now want to inspect this table using the ruleExplorer()
basket_apriori %>% ruleExplorer()To help visualise these rules, we can produce a basic scatterplot of the metrics.
ggplot(basket_apriori_tbl) +
geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
xlab("Rule Confidence") +
ylab("Rule Lift") +
ggtitle("Scatterplot of Association Rule Metrics")2.3 Construct eclat Rules
An alternative method of constructing association rules is to use the eclat
algorithm. The code for doing this is slightly different, but gives us similar
outputs.
basket_eclat <- eclat(
basket_arules,
parameter = list(support = 0.005)
) %>%
ruleInduction(
basket_arules,
confidence = 0.8
)## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.005 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 207
##
## create itemset ...
## set transactions ...[4965 item(s), 41410 transaction(s)] done [0.27s].
## sorting and recoding items ... [1398 item(s)] done [0.02s].
## creating sparse bit matrix ... [1398 row(s), 41410 column(s)] done [0.01s].
## writing ... [6766 set(s)] done [3.13s].
## Creating S4 object ... done [0.00s].
basket_eclat_tbl <- basket_eclat %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))
basket_eclat_tbl %>% glimpse()## Rows: 423
## Columns: 5
## $ rules <chr> "{22917,22918} => {22916}", "{22916,22918} => {22917}", "{2…
## $ support <dbl> 0.005143685, 0.005143685, 0.005047090, 0.005047090, 0.00560…
## $ confidence <dbl> 0.9466667, 0.9508929, 0.9500000, 0.9372197, 0.9317269, 0.92…
## $ lift <dbl> 157.4356, 156.2558, 156.1091, 155.8645, 153.1064, 153.1064,…
## $ itemset <int> 52, 52, 56, 56, 59, 59, 52, 54, 54, 53, 53, 56, 34, 34, 55,…
Once again, we inspect the data using ruleExplorer()
basket_eclat %>% ruleExplorer()2.4 Compare Algorithms
We now want to compare the outputs of both algorithms in terms of association rules and how they compare.
basket_ap_tbl <- basket_apriori_tbl %>%
select(rules, support, confidence, lift)
basket_ec_tbl <- basket_eclat_tbl %>%
select(rules, support, confidence, lift)
rules_comparison_tbl <- basket_ap_tbl %>%
full_join(basket_ec_tbl, by = "rules", suffix = c("_a", "_e"))
rules_comparison_tbl %>% glimpse()## Rows: 423
## Columns: 7
## $ rules <chr> "{22917,22918} => {22916}", "{22916,22918} => {22917}", "…
## $ support_a <dbl> 0.005143685, 0.005143685, 0.005047090, 0.005047090, 0.005…
## $ confidence_a <dbl> 0.9466667, 0.9508929, 0.9500000, 0.9372197, 0.9317269, 0.…
## $ lift_a <dbl> 157.4356, 156.2558, 156.1091, 155.8645, 153.1064, 153.106…
## $ support_e <dbl> 0.005143685, 0.005143685, 0.005047090, 0.005047090, 0.005…
## $ confidence_e <dbl> 0.9466667, 0.9508929, 0.9500000, 0.9372197, 0.9317269, 0.…
## $ lift_e <dbl> 157.4356, 156.2558, 156.1091, 155.8645, 153.1064, 153.106…
2.5 Reducing Minimum Confidence
While high confidence rules are useful, they are more likely to find rules that are “obvious” as the probabilities are such that co-occuring basket items will be noticed as being together - or possibly be natural complements: butter, milk and bread is a good example.
Instead, we are also interested in less obvious rules, and so we reduce our confidence threshold and see how many additional rules are discovered.
basket_lower_rules <- apriori(
basket_arules,
parameter = list(supp = 0.005, conf = 0.4)
)## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.4 0.1 1 none FALSE TRUE 5 0.005 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 207
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4965 item(s), 41410 transaction(s)] done [0.27s].
## sorting and recoding items ... [1398 item(s)] done [0.02s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.19s].
## writing ... [4991 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
basket_lower_rules_tbl <- basket_lower_rules %>%
as("data.frame") %>%
as_tibble() %>%
arrange(desc(lift))ggplot(basket_lower_rules_tbl) +
geom_point(aes(x = confidence, y = lift), alpha = 0.2) +
xlab("Rule Confidence") +
ylab("Rule Lift") +
ggtitle("Scatterplot of Association Rule Metrics")3 Converting Rules to Graphs
We also have the ability to convert these rules to a graph representation,
where each node is either a stock_code or a rule, with the edges of the
graph representing that item being contained in the rule.
apriori_rules_igraph <- basket_apriori %>%
plot(
measure = "support",
method = "graph",
control = list(max = 1000)
) %>%
as("igraph")apriori_rules_igraph %>% print()## IGRAPH 958f098 DN-- 556 1571 --
## + attr: name (v/c), label (v/c), support (v/n), confidence (v/n),
## | coverage (v/n), lift (v/n), count (v/n), order (v/n)
## + edges from 958f098 (vertex names):
## [1] 1749->assoc1 1270->assoc2 2135->assoc3 2134->assoc4 2135->assoc5
## [6] 2136->assoc6 2135->assoc7 2132->assoc8 2135->assoc9 2133->assoc10
## [11] 2134->assoc11 2136->assoc12 2134->assoc13 2132->assoc14 2134->assoc15
## [16] 2133->assoc16 2136->assoc17 2132->assoc18 2136->assoc19 2133->assoc20
## [21] 2132->assoc21 2133->assoc22 2378->assoc23 2378->assoc24 1967->assoc25
## [26] 1967->assoc26 1967->assoc27 2377->assoc28 1966->assoc29 1966->assoc30
## [31] 1125->assoc31 1919->assoc32 1747->assoc33 1749->assoc33 1748->assoc34
## + ... omitted several edges
We should first visualise this graph, using the top 30 rules in the dataset, as measured by the support of the rule.
basket_apriori %>%
head(n = 30, by = "support") %>%
plot(
measure = "lift",
method = "graph",
engine = "htmlwidget"
)3.1 Extract Connected Product Labels
First we want to look at the different disjoint components of the graph, and label them with an ID.
apriori_rules_tblgraph <- apriori_rules_igraph %>%
as_tbl_graph() %>%
mutate(
component_id = group_components()
) %>%
group_by(component_id) %>%
mutate(
component_size = n()
) %>%
ungroup()We then want to create groups of common products that form a disjoint cluster within this graph.
product_groups_all_tbl <- apriori_rules_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(component_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(product_group_id = component_id, product_count, stock_code = label) %>%
arrange(product_group_id, stock_code)
product_groups_all_tbl %>% glimpse()## Rows: 133
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count <int> 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 6…
## $ stock_code <chr> "20711", "20712", "20713", "20718", "20719", "20723",…
For display purposes, we can show all the stock_id values in a list.
3.1.1 Cluster Larger Groups
Within the large disjoint cluster there are a large number of products so rather than treating this as a single group we instead may investigate using further graph clustering algorithms to create further groupings.
apriori_rules_large_tblgraph <- apriori_rules_tblgraph %>%
to_subgraph(component_size == max(component_size)) %>%
use_series(subgraph) %>%
morph(to_undirected) %>%
mutate(
sub_id = group_edge_betweenness()
) %>%
unmorph()Now that we have sub-divided this large subgraph, we repeat the process.
product_groups_largest_tbl <- apriori_rules_large_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(sub_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(product_group_id = sub_id, product_count, stock_code = label) %>%
arrange(product_group_id, stock_code)
product_groups_largest_tbl %>% glimpse()## Rows: 61
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,…
## $ product_count <int> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 1…
## $ stock_code <chr> "20719", "20723", "20724", "20725", "20726", "20727",…
Finally, it is worth trying to use an interactive tool to investigate this
subgraph, we we can use visNetwork() to produce an interactive JS tool
apriori_rules_large_tblgraph %>%
toVisNetworkData(idToLabel = FALSE) %>%
visNetwork(
nodes = .$nodes %>% transmute(id, label, group = sub_id),
edges = .$edges
)3.2 Evaluating Product Groups
How do we go about assessing the validity of these product groups?
Note that this work is exploratory - in effect this is more sophisticated data exploration. Rather than use this model to make predictions - a job we will need to do at some point, we instead just want to assess how novel these grouping are.
To that end, it may be useful to check the co-occurrence of these products as a group - for each purchase made by a customer, what proportion of the group was featured in this data?
This question is worth exploring, so we should write some code to assess this.
Before we do this, we combine our two lists of product groups into a single table.
stock_groups_tbl <- list(
ALL = product_groups_all_tbl,
LRG = product_groups_largest_tbl
) %>%
bind_rows(.id = "type") %>%
mutate(
group_label = sprintf("%s_%02d", type, product_group_id)
) %>%
group_by(group_label) %>%
mutate(
group_size = n()
) %>%
ungroup() %>%
select(group_label, group_size, stock_code)
stock_groups_tbl %>% glimpse()## Rows: 194
## Columns: 3
## $ group_label <chr> "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01"…
## $ group_size <int> 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61, 61…
## $ stock_code <chr> "20711", "20712", "20713", "20718", "20719", "20723", "207…
tnx_groups_tbl <- tnx_data_tbl %>%
select(invoice_id, invoice_date, stock_code) %>%
group_nest(invoice_id, .key = "invoice_data")
arules_groups_tbl <- stock_groups_tbl %>%
group_nest(group_label, group_size, .key = "stock_data")
group_props_tbl <- arules_groups_tbl %>%
filter(group_size > 1, group_size < 15) %>%
expand_grid(tnx_groups_tbl) %>%
mutate(
comb_data = future_map2(
invoice_data, stock_data,
inner_join,
by = "stock_code",
.options = furrr_options(globals = FALSE)
),
match_count = map_int(comb_data, nrow),
group_prop = match_count / group_size
) %>%
select(group_label, group_size, group_prop) %>%
filter(group_prop > 0)
group_props_tbl %>% glimpse()## Rows: 63,177
## Columns: 3
## $ group_label <chr> "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02", "ALL_02"…
## $ group_size <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ group_prop <dbl> 0.2, 0.6, 0.3, 0.3, 0.1, 0.7, 0.2, 0.4, 0.2, 0.4, 0.1, 0.2…
We now create a histogram of the proportions for each group, and this gives us a gauge of the ‘novelty’ of each of these groups.
plot_tbl <- group_props_tbl %>%
mutate(label = glue("{group_label} ({group_size})"))
ggplot(plot_tbl) +
geom_histogram(aes(x = group_prop), binwidth = 0.1) +
facet_wrap(vars(label), scales = "free_y") +
scale_y_continuous(labels = label_comma()) +
xlab("Proportion") +
ylab("Purchase Count") +
ggtitle("Facetted Histograms of Group Coverages by Product Grouping") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))3.2.1 Investigate Groups
Now that we have our groups we add on the description fields so that interpretation of the different groupings is easier.
stock_groups_tbl %>%
filter(group_size > 1, group_size < 15) %>%
mutate(stock_code_upr = stock_code %>% str_trim() %>% str_to_upper()) %>%
left_join(stock_code_lookups_tbl, by = "stock_code_upr") %>%
datatable()3.2.2 Write Product Groups
As this may be useful for later analysis and for later modelling, we output these groupings for later use.
stock_groups_tbl %>% write_rds("data/stock_groups_tbl.rds")4 Investigate Lower Support Rules
Our previous analysis was all based on rules with a minimum confidence of 0.80 so we now want to repeat our analysis but on this new set of rules.
apriori_lower_rules_igraph <- basket_lower_rules %>%
plot(
measure = "support",
method = "graph",
control = list(max = 5000)
) %>%
as("igraph")apriori_lower_rules_igraph %>% glimpse()## List of 10
## $ :List of 1
## ..$ 25: 'igraph.vs' Named int [1:4] 678 689 1384 1386
## .. ..- attr(*, "names")= chr [1:4] "assoc219" "assoc230" "assoc925" "assoc927"
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## $ :List of 1
## ..$ 27: 'igraph.vs' Named int [1:3] 690 1385 1386
## .. ..- attr(*, "names")= chr [1:3] "assoc231" "assoc926" "assoc927"
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## $ :List of 1
## ..$ 29: 'igraph.vs' Named int 467
## .. ..- attr(*, "names")= chr "assoc8"
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## $ :List of 1
## ..$ 218: 'igraph.vs' Named int [1:9] 639 641 642 1366 1367 1369 1370 1372 1373
## .. ..- attr(*, "names")= chr [1:9] "assoc180" "assoc182" "assoc183" "assoc907" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## $ :List of 1
## ..$ 219: 'igraph.vs' Named int [1:42] 516 640 796 798 800 802 803 1299 1301 1341 ...
## .. ..- attr(*, "names")= chr [1:42] "assoc57" "assoc181" "assoc337" "assoc339" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## $ :List of 1
## ..$ 220: 'igraph.vs' Named int [1:23] 804 963 979 1370 1371 1373 1374 1400 1401 1472 ...
## .. ..- attr(*, "names")= chr [1:23] "assoc345" "assoc504" "assoc520" "assoc911" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## $ :List of 1
## ..$ 221: 'igraph.vs' Named int [1:42] 607 801 855 913 962 1342 1343 1348 1349 1354 ...
## .. ..- attr(*, "names")= chr [1:42] "assoc148" "assoc342" "assoc396" "assoc454" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## $ :List of 1
## ..$ 223: 'igraph.vs' Named int [1:4] 677 679 1384 1385
## .. ..- attr(*, "names")= chr [1:4] "assoc218" "assoc220" "assoc925" "assoc926"
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## $ :List of 1
## ..$ 249: 'igraph.vs' Named int [1:90] 1056 1057 1058 1059 1800 1801 1802 1803 1804 1805 ...
## .. ..- attr(*, "names")= chr [1:90] "assoc597" "assoc598" "assoc599" "assoc600" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## $ :List of 1
## ..$ 250: 'igraph.vs' Named int [1:152] 1201 1202 1203 1823 1824 1833 1835 1836 1838 1839 ...
## .. ..- attr(*, "names")= chr [1:152] "assoc742" "assoc743" "assoc744" "assoc1364" ...
## .. ..- attr(*, "env")=<weakref>
## .. ..- attr(*, "graph")= chr "5e774f1b-4607-46fa-874b-75f1f1e13d77"
## - attr(*, "class")= chr "igraph"
Once again we have a quick look at the top 50 rules.
basket_lower_rules %>%
head(n = 50, by = "support") %>%
plot(
measure = "lift",
method = "graph",
engine = "htmlwidget"
)4.1 Determine Distinct Rules Subgraphs
Having converted the association rules to the graph, we then look at the distinct components of this graph and use these as our first pass at these clusters.
apriori_lower_rules_tblgraph <- apriori_lower_rules_igraph %>%
as_tbl_graph() %>%
mutate(
component_id = group_components()
) %>%
group_by(component_id) %>%
mutate(
component_size = n()
) %>%
ungroup()
apriori_lower_rules_tblgraph %>% print()## # A tbl_graph: 5450 nodes and 15768 edges
## #
## # A directed simple graph with 88 components
## #
## # Node Data: 5,450 x 10 (active)
## name label support confidence coverage lift count order component_id
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
## 1 25 1505… NA NA NA NA NA NA 14
## 2 27 1505… NA NA NA NA NA NA 14
## 3 29 1505… NA NA NA NA NA NA 14
## 4 218 20674 NA NA NA NA NA NA 2
## 5 219 20675 NA NA NA NA NA NA 2
## 6 220 20676 NA NA NA NA NA NA 2
## # … with 5,444 more rows, and 1 more variable: component_size <int>
## #
## # Edge Data: 15,768 x 2
## from to
## <int> <int>
## 1 375 460
## 2 376 461
## 3 78 462
## # … with 15,765 more rows
product_groups_lower_all_tbl <- apriori_lower_rules_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(component_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(product_group_id = component_id, product_count, stock_code = label) %>%
arrange(product_group_id, stock_code)
product_groups_lower_all_tbl %>% glimpse()## Rows: 459
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count <int> 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214…
## $ stock_code <chr> "20711", "20712", "20713", "20717", "20718", "20719",…
apriori_lower_rules_bigcomp_tblgraph <- apriori_lower_rules_tblgraph %>%
to_subgraph(component_size == max(component_size)) %>%
use_series(subgraph) %>%
morph(to_undirected) %>%
mutate(
sub_id = group_edge_betweenness()
) %>%
unmorph()product_groups_lower_bigcomp_tbl <- apriori_lower_rules_bigcomp_tblgraph %>%
activate(nodes) %>%
as_tibble() %>%
filter(are_na(support)) %>%
group_by(sub_id) %>%
mutate(
product_count = n()
) %>%
ungroup() %>%
select(product_group_id = sub_id, product_count, stock_code = label) %>%
arrange(product_group_id, stock_code)
product_groups_lower_bigcomp_tbl %>% glimpse()## Rows: 214
## Columns: 3
## $ product_group_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ product_count <int> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 1…
## $ stock_code <chr> "20711", "20712", "20713", "20717", "20718", "21928",…
stock_groups_lower_tbl <- list(
ALL = product_groups_lower_all_tbl,
LRG = product_groups_lower_bigcomp_tbl
) %>%
bind_rows(.id = "type") %>%
mutate(
group_label = sprintf("%s_%02d", type, product_group_id)
) %>%
group_by(group_label) %>%
mutate(
group_size = n()
) %>%
ungroup() %>%
select(group_label, group_size, stock_code)
stock_groups_lower_tbl %>% glimpse()## Rows: 673
## Columns: 3
## $ group_label <chr> "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01", "ALL_01"…
## $ group_size <int> 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214, 214…
## $ stock_code <chr> "20711", "20712", "20713", "20717", "20718", "20719", "207…
Now that we have our groups we add on the description fields so that interpretation of the different groupings is easier.
stock_groups_lower_tbl %>%
filter(group_size > 1, group_size != max(group_size)) %>%
mutate(stock_code_upr = stock_code %>% str_trim() %>% str_to_upper()) %>%
left_join(stock_code_lookups_tbl, by = "stock_code_upr") %>%
datatable()5 R Environment
sessioninfo::session_info()## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.4 (2021-02-15)
## os Ubuntu 20.04.2 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2021-05-24
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## arules * 1.6-7 2021-03-16 [1] RSPM (R 4.0.4)
## arulesViz * 1.4-0 2021-03-07 [1] RSPM (R 4.0.3)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.3)
## backports 1.2.1 2020-12-09 [1] RSPM (R 4.0.3)
## bookdown 0.21 2020-10-13 [1] RSPM (R 4.0.2)
## broom 0.7.5 2021-02-19 [1] RSPM (R 4.0.3)
## bslib 0.2.4 2021-01-25 [1] RSPM (R 4.0.3)
## cachem 1.0.4 2021-02-13 [1] RSPM (R 4.0.3)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.0.3)
## cli 2.3.1 2021-02-23 [1] RSPM (R 4.0.3)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.0.4)
## colorspace 2.0-0 2020-11-11 [1] RSPM (R 4.0.3)
## conflicted * 1.0.4 2019-06-21 [1] RSPM (R 4.0.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.0.3)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.0.3)
## crosstalk 1.1.1 2021-01-12 [1] RSPM (R 4.0.3)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.0.3)
## dbplyr 2.1.0 2021-02-03 [1] RSPM (R 4.0.3)
## digest 0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
## dplyr * 1.0.5 2021-03-05 [1] RSPM (R 4.0.3)
## DT * 0.17 2021-01-06 [1] RSPM (R 4.0.3)
## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.3)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.3)
## fansi 0.4.2 2021-01-15 [1] RSPM (R 4.0.3)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.0.3)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.0.3)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.0.3)
## foreach 1.5.1 2020-10-15 [1] RSPM (R 4.0.3)
## fs 1.5.0 2020-07-31 [1] RSPM (R 4.0.3)
## furrr * 0.2.2 2021-01-29 [1] RSPM (R 4.0.3)
## future * 1.21.0 2020-12-10 [1] RSPM (R 4.0.3)
## generics 0.1.0 2020-10-31 [1] RSPM (R 4.0.3)
## ggplot2 * 3.3.3 2020-12-30 [1] RSPM (R 4.0.3)
## globals 0.14.0 2020-11-22 [1] RSPM (R 4.0.3)
## glue * 1.4.2 2020-08-27 [1] RSPM (R 4.0.3)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.3)
## haven 2.3.1 2020-06-01 [1] RSPM (R 4.0.3)
## highr 0.8 2019-03-20 [1] RSPM (R 4.0.3)
## hms 1.0.0 2021-01-13 [1] RSPM (R 4.0.3)
## htmltools 0.5.1.1 2021-01-22 [1] RSPM (R 4.0.3)
## htmlwidgets 1.5.3 2020-12-10 [1] RSPM (R 4.0.3)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.3)
## igraph 1.2.6 2020-10-06 [1] RSPM (R 4.0.3)
## iterators 1.0.13 2020-10-15 [1] RSPM (R 4.0.3)
## jquerylib 0.1.3 2020-12-17 [1] RSPM (R 4.0.3)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.0.3)
## knitr 1.31 2021-01-27 [1] RSPM (R 4.0.3)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.0.3)
## lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.4)
## lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.0.3)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.0.3)
## lubridate 1.7.10 2021-02-26 [1] RSPM (R 4.0.3)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.0.3)
## Matrix * 1.3-2 2021-01-06 [2] CRAN (R 4.0.4)
## memoise 2.0.0 2021-01-26 [1] RSPM (R 4.0.3)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.0.3)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.3)
## parallelly 1.24.0 2021-03-14 [1] RSPM (R 4.0.3)
## pillar 1.5.1 2021-03-05 [1] RSPM (R 4.0.3)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.3)
## ps 1.6.0 2021-02-28 [1] RSPM (R 4.0.3)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.0.3)
## R6 2.5.0 2020-10-28 [1] RSPM (R 4.0.3)
## Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.0.3)
## readr * 1.4.0 2020-10-05 [1] RSPM (R 4.0.4)
## readxl 1.3.1 2019-03-13 [1] RSPM (R 4.0.3)
## registry 0.5-1 2019-03-05 [1] RSPM (R 4.0.0)
## reprex 1.0.0 2021-01-27 [1] RSPM (R 4.0.3)
## rlang * 0.4.10 2020-12-30 [1] RSPM (R 4.0.3)
## rmarkdown 2.7 2021-02-19 [1] RSPM (R 4.0.3)
## rmdformats 1.0.1 2021-01-13 [1] RSPM (R 4.0.3)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.0.3)
## rvest 1.0.0 2021-03-09 [1] RSPM (R 4.0.3)
## sass 0.3.1 2021-01-24 [1] RSPM (R 4.0.3)
## scales * 1.1.1 2020-05-11 [1] RSPM (R 4.0.3)
## seriation 1.2-9 2020-10-01 [1] RSPM (R 4.0.2)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.3)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.3)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.0.3)
## tibble * 3.1.0 2021-02-25 [1] RSPM (R 4.0.3)
## tidygraph * 1.2.0 2020-05-12 [1] RSPM (R 4.0.3)
## tidyr * 1.1.3 2021-03-03 [1] RSPM (R 4.0.4)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.3)
## tidyverse * 1.3.0 2019-11-21 [1] RSPM (R 4.0.3)
## TSP 1.1-10 2020-04-17 [1] RSPM (R 4.0.0)
## utf8 1.2.1 2021-03-12 [1] RSPM (R 4.0.3)
## vctrs 0.3.7 2021-03-29 [1] RSPM (R 4.0.4)
## visNetwork 2.0.9 2019-12-06 [1] RSPM (R 4.0.3)
## withr 2.4.1 2021-01-26 [1] RSPM (R 4.0.3)
## xfun 0.22 2021-03-11 [1] RSPM (R 4.0.3)
## xml2 1.3.2 2020-04-23 [1] RSPM (R 4.0.3)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.3)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library